THE ADABOOST FLOW 



A. LYKOV, S.MUZYCHKA AND K. VANINSKY 

Abstract. We introduce a dynamical system which we caU the AdaBoost flow. 
The flow is defined by a system of ODEs with control. We show how by a suitable 
choice of control the AdaBoost algorithm of Schapire and Freund and the arc-gv 
algorithm of Breiman can be embedded in the AdaBoost flow. We also show how 
^ . confidence rated prediction previously studied by Schapire and Singer also can 

be obtained from our continuous time approach. We introduce a new continuous 
time algorithm which we call superBoost and describe its properties. 

The nontrivial part of the AdaBoost flow equations coincides with the equa- 
tions of dynamics of the nonperiodic Toda system written in terms of spectral 
variables. This establishes a connection between the two seemingly unrelated 
fields of boosting algorithms and exactly soluble models of classical mechanics. 
Finally we explain the similarity of the AdaBoost construction with Perelman's 
ideas to control the Ricci flow. 
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1. Introduction. 

The AdaBoost algorithm does not need advertisement in the data mining com- 
munity. It was discovered by Robert Schapire and Yoav Freund in their seminal 
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paper in 1997, |4j. Nowadays, together with the PageRank algorithm, AdaBoost is 
considered among the top ten algorithm in data mining, pjj. It is worth mention- 
ing that for their AdaBoost paper [1], Schapire and Freund won the Godel Prize, 
which is one of the most prestigious awards in theoretical computer science, in the 
year of 2003. 

The AdaBoost algorithm appeared from an abstract problem. In 1988, Kearns 
and Valiant asked a question whether a learning algorithm that performs just 
slightly better than random guess could be boosted into an arbitrarily accurate 
learning algorithm. Schapire in 1990, [9], found that the answer is yes, and the 
proof he gave is a construction, which is the first boosting algorithm. The Ad- 
aBoost proposed in 1997, [1], has given rise to extensive research on theoretical 
aspects of ensemble methods, which can be easily found in the machine learning 
and statistical literature. From a practical viewpoint AdaBoost is used to construct 
spam filtering systems, search engines, face recognition and recommender systems 
to name a few possibilities. The Mathematician can forget about all that and treat 
AdaBoost as an algorithm which solves some special optimization problem. 

In the present note we introduce a dynamical system which we call the AdaBoost 
flow. The flow is defined by a system of ODEs with control. We show how, by 
a suitable choice of control the AdaBoost algorithm of Schapire and Freund and 
arc-gv algorithm of Breiman, can be embedded in the AdaBoost flow. We 
also show how the confidence rated prediction previously studied by Schapire and 
Singer, [10] , can be obtained from our continuous time approach. We also introduce 
a new continuous time algorithm which we call SuperBoost. 

We establish a connection between the AdaBoost flow and the classical non- 
periodic Toda system of particles. Namely, the nontrivial part of the AdaBoost 
flow coincides with the dynamics of nonperiodic Toda system written in terms of 
spectral variables. Introduced in 1967, [11] , the Toda lattice is a basic example of 
a system of classical mechanics an integrable in the Liouville sense. Its complete 
integrability was proved by Moser in [B]. The algebraic-geometrical approach to 
integrability was developed by Krichever and Vaninsky, [S]. The relation of the 
algebraic-geometrical approach to classical spectral theory was investigated in [15] . 

Finally we discuss the intriguing similarity of the AdaBoost algorithm with 
Perelman's ideas to control the Ricci flow. It turns out that all parts of the 
AdaBoost algorithm have their counterparts in the Perelman's construction. We 
present the dictionary between two problems. 

We would like to thank Igor Krichever, Sasha Veselov and Vadim Malyshev for 
stimulating discussions. Our special thanks to Henry McKean for his numerous 
remarks. 
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2. Discrete time AdaBoost algorithm. 
2.1. Basic AdaBoost algorithm. Given a set of points 

TS ^ {{xi,yi),..,(xm,ym)}, 

where Xi E X and yi e {—1,+!}. This set is called a training set. Usually 
X = and yi arc called labels. Also given a finite set of prescribed functions 
T-Lq = {h^, 7 G r}; for each E : X ^ +!}• These functions are usually 
called weak classifiers. Pick some probability distribution w{i), i = l,...,m; on 
the points of training set. The classification error of any weak classifier /i^ 

W~{h^,w) ^ w{i : h^{xi)yi = -1} 

can be quite big, i.e. W~{h^,w) » 0. Therefore, we can try to combine these 
weak classifiers to obtain better classification error. 
Let be a positive cone over a set of basic classifiers 

H — {H : H — a^hj] e Ho, > 0}. 

From any H, the combined classifier H : X — )■ {—1,0,+!} can be constructed. 
Namely, if H{x) ^ 0, then H(x) = signi7(a;); if H{x) = 0, then no decision can 
be made and H(x) = 0. Let us define 

W-{H, w) = w{i : ll{xi)yi = -1} W\H, w) = w{i : ll{xi)yi = 0}. 

The problem is to minimize the error W~ + of the combined classifier H by 
choosing appropriate values of a^. The difficulty of this constrained minimization 
problem is that the error is almost everywhere constant on "H and gradient methods 
can not be applied directly. 

The AdaBoost algorithm offers a candidate for the solution of this problem in 
a series of + 1 rounds, where N is some whole number. For any n = 0, 1, N; 
the combined classifier H„(a;) = H„ : X {—1, 0, +1} is defined as 

Hji — tohjQ -|- ... -|- tjihj^, 

where the sequence of positive numbers to,ti, ...,tn, is constructed simultaneously 
with /I's. The final classifier H = Htv is a candidate for the solution of the min- 
imization problem. In practical applications one simply chooses N large enough. 
Theoretical bound for misclassification error and sufficient condition for the error 
free classifier will be given below. 

Specifically, the AdaBoost recursively constructs a family of classifiers by means 
of probability measures Wo,Wi, wn. It starts with the fixed distribution w: 

wq : wo{i) — w{i), i — l,...,m.. 
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Given a distribution Wn, n = 0, ...,N; the AdaBoost algorithm picks arbitrary h^^ 
from T-Lq such that 

W- = W-{h,„,w^)< 1/2. (2.1) 
If at some step it is not possible to do this, i. e. if 

min W~{hy,Wn) > 1/2; 

then the procedure stops, unfinished. The reason for this will be explained later. 
Note that on each step the algorithm does not have to go through the whole set 
T-Lq, one has just to find one that satisfies fI7[\ If this is the case and W~ < 1/2 
for all n = 0, 1, A^; then the measure is constructed recurrently 

Wn+l{l) = , 



where tn is some positive number and 



m 



i=l 



The whole procedure can be represented by a diagram 

Ho Hi ... Hn 

t\t \-\t 
Wq Wi ... Wn 

The function H{x) = Hn{x) takes values in the segment [— T, +T], where T = 

At each step of the AdaBoost procedure the set of training points TS falls into 
two categories Gn (good) and i?„ (bad). Points of (?„ are those that classified 
correctly by h^^ 

Gn = {ixi,yi) : h^^{xi)yi = +1}. 
The measure = w„{G'„} of these points decreases upon the next step 

Wn{i) Wn+l{i) = -^Wn{i), {Xi^Vi) G Gn- 

The points of -B„ are those that were misclassified by h^^: 

Bn = {ixi,y.i) : h^^X^i)yi = -1}. 
The measure W~ = Wn{Bn} of these points increases upon the next step: 

Wn{i) Wn+lii) = —Wn{i), {Xi,yi) G 5„. 



Apparently, + W~ = 1 and W~ <|, n = 0,1, N. The values of t„ at each 
step are chosen to minimize probability of error of the final combined classifier. 
Remark 12.11 shows that with an optimal choice of t„ 

Wn+l{Gn} = Wn+l{Bn} = ^. 

2.2. AdaBoost map on the extended phase space. Boosting can be viewed 
as a discrete time dynamical system on the extended phase space T-ixW which is 
the direct product of the positive cone 1-L and the simplex of probability measures 
W = {w : = 1, > 0}. The vector field v{H,w) on T-L x W is a 

constant h = h{w) on the "fibers" T-Lw = {{H,w') : H E Ti, w' = w}, i.e. 

v{H, w) = h, for any H G "H^. 

The AdaBoost dynamics maps {Hn,Wn) into {Hn+i,Wn+i) by the rule 

Wn+l{^) = ^ —, n = 0,l,...- z = l,2, m; (2.2) 

Hn+l= Hn + tn+lV{Hn,Wn+l), n = - 1 , 0, 1 , . . . . (2.3) 

Along the trajectory {(i/„, w„), n = 0, 1, A^} the Adaboost dynamics drives the 
error 

m 
1=1 

to zero. The idea of the proof is to overestimate the function W~{-, wo) + W^{-, wq) 
which has no gradient by some smooth function S{-,wq) defined as 



£{H,w) = w{i)e 



-yiH{xi) 
1=1 

The function £{■,■) is strictly convex in the first argument 

S{XH' + fiH",w) < X£{H',w) + i2£{H",w), 
and linear in the second. Clearly, 



W-iH,Wo) + W'>{H,Wo) = $^^(OX[s/,/^(x.)<o](a;.) < S{H,w) = ^w;(^)e- 



-yiH{xi) 

i=l i=l 

The point is that the function S{-, wq) plays the role of Lyapunov function for the 
AdaBoost dynamics. 

Equations of the Adaboost dynamics imply that the values of £{Hn,WQ) along 
the trajectory satisfy two equivalent identities. The first connects two consecutive 
values 

£{Hn+l,WQ) = Zn+l£{Hn,Wo). (2.4) 



The second identity reads 

n 

«;o) = n^f (2.5) 

p=0 

The identities can be easily proved by means of the relation 

£{Hn,Wk) = Zk X ■ ■ ■ X Zn£iHk-i,Wn+i), k<n; (2.6) 

and boundary condition S{H_i,Wn) = 1, due to — 0. Both identities have 
their counterparts in the continuous time case. 

The constant t„ > is chosen to minimize Z„ on each step. In detail, 

and from the condition of critical point = we get an explicit formula 

1 W+ 

tn^ -log^. 

The constant tn is positive if and only if W~ < 1/2. The formula for Z„ 

with optimal tn follows easily. If W~ — \ — fim then 

and 

W-{HN,wa) + W\HN,wa) < S{Hn,wo) < e-^^^=o'^p. 

Therefore the training error decays exponentially with N, if /3„ are uniformly 
bounded from zero. Moreover, if N is such that 

minwo(i) > e~^^p=o^p, 

i 

then W-{Hn,wo) + W'^{Hn,wo) = 0. 

Remark 2.1. Note that formulas for Zn and tn imply 

e*" _ e^"W~ 1 

Wn+l{Bn} = ^ = ^ " , , = o • 

Zn 2^W-W+ 2 

3. Continuous time AdaBoost algorithm. 

3.1. Differential equations for the AdaBoost flow. In this section we intro- 
duce a continuous time AdaBoost flow on the TL x W. Namely we construct a 
family of combined classificators Ht{x) and measures Wt, for all t, < t < T. 
Differential equation allows us to define the AdaBoost flow when the weak clas- 
sifiers take arbitrary real values, i.e. we assume that any h-y e T-Lo is such that 



Let efc(x) for k = 1, ...,m be a basis in the space of all classifiers; meaning that 
the matrix \ \ek{xj)\\j^k=i,...,m is of a full rank m. Then, 

Kq CT-L C spanjcfc, k = 1, m}. 

Therefore, any classifier Ht can be written as 

Ht{x) = A,^ei + ... + A[^e„. 

Let 74 : [0, oo) — )■ r be a function with finite number of values on any finite interval 
and let it be continuous from the right with respect to the discrete topology on F. 
We choose a vector field constant on l-Lwt as in discrete case, i.e. 

v{H,wt) = hy^; 

for any H G and write 

V = v^ei + f^e2 + ... + f^'Cm. 
In this language the AdaBoost flow differential equations are the following 

j^X^ = v'{H„w,), A; = l,2,...,m; (3.1) 

^Wt{k) = -ykv{Ht,Wt){xk)wtik) + atWtik), /c = 1, 2, m; (3.2) 

where at = cTwt = J2^=iyp'^i^t,'Wt){xp)wt{p). It can be checked easily that the 
quantity 

w{l) +w{2) + ... + w{m) 

is an integral of motion. Therefore, the orbits of the AdaBoost flow remain on the 
simplex W for all times. 

The solution of the differential equations with a fixed is a straight line motion 

Ht = Ho + tx v{Ho,wt)] 

and for the measure 

Mk) = ^-^^^^(p)g-ty,.(Ho,»o)(xp)' /c = l,2,...,m. 

The equations for Wt{k), k = 1, ...,m; coincide with the equations for spectral 
weights in [B] and [TB]. In the case of Toda lattice all the numbers ykv{Ht, Wt){xk) 
are distinct for different k. They are the simple spectrum of the Jacobi matrix. 
Here the situation is different. In the case of weak classifiers which take only two 
values +1 and —1, the components ykv{Ht,Wt){xk) of the vector field also take 
only these two possible values. 
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Now we want to write the equations 13.21 in a different form. This new form will 
be used in section 3.7. The measure w can be defined in terms of the potential 
function / as Wt (k) = e--^'(^), k = l,...,m. Then we can write 13.21 as 

^ ftik) = ykv{Ht,Wt){xk) - at, /c = 1, 2, m; (3.3) 

What are the orbits of the AdaBoost flow on the simplex W7 Let weak classifiers 
take values {+1,-1}. Define 

W+ = Wo{i : vMxi) = 1}, 
W~ = wo{i : yih{xi) = -1} > 0; 

and 

1 

W+ +~2W- 

Lemma 3.1. Assume that the AdaBoost flow runs for all t > with the fixed 
v{Ht,Wt) i.e. it comes from a fixed single classifier h. Then for 

wt = C[wo]+V[wo]U{t), 

where the vectors C[wq] and T>[wq] are defined by the following formulas: 

£WW = |^ ..M.,) = -i -1.....™; 



w+ 
'w- 



wo{i) yih{xi) = -1 



Proof. Substitute explicit expression for the flow into the formulas. □ 
Since for t ^ 0, U{t) G (0,1], the orbit of the point Wq under the AdaBoost 
flow is a semi-interval between the points Wq and C{wq). Moreover, as t — )■ +oo, 
Wt — )■ >C[t(7o], i.e-. the AdaBoost flow transports the measure towards the points 
where the classificator makes an error. 

Now let weak classifiers take values in { — 1,0,+1}. Define: 

= Wo{i : h{xi) = 0}. 

Let us assume < < 1. Define: 

Z{t) = W+e-' + W-e' + 



Lemma 3.2. For any t ^ 0, 

where the vectors 'D~^[wo] and V^[wo\ are defined as: 



wpji) 



yih{xi) = -1 



I otherwise 

{Wo{i) vMxi) = 1 

h{xi)^0 i = l,...,m; 

-^Wo{i) vMxi) = -1 

{0 vMxi) = 1 

Wo{i) h{xi)^0 i = l,...,m. 

-^Wo{i) vMxi) = -1 
Moreover, functions a and (3 satisfy the equation 

aa^ + da/3 + bp^ - a^O, 
where a — W'^, b — W~ and d — 1^°. 

Proof. The equations follow from the obvious relations 

- = Ze-' 

1 _ b 1 dl 
a a Ze~^ a Z 

□ 

As in the first case when classifier takes only two values, Wt J^iwo] when 
t — > +00. In the present case with three values, the orbit of wo lies in a two 
dimensional plane on a second degree algebraic curve. 

Lemma 3.3. Assume that the AdaBoost flow runs on the infinite time interval 
with the same v{Ht,Wt) i.e. it comes from the same classificator. Let 

Kiax = raaxykv{Ht, Wt){xk) and V^in = m.mykv{Ht, Wt){xk). 

k k 

Then, f^a{t) < and limt^.^o a{t) = Kmx, lim^^+oo = Knm- 
Proof. By Jensen's inequality and the strict convexity of the function x"^, we get 
d 

^ cr(^) = XI yp'^i^t^ wt){xp) [-ypv{Ht, Wt){xp)wt{p) + crtWt{p)] = 
p=i 

m 

= -^[ypv{Ht,Wt){xp)f wt{p) + al < 0; 
p=i 
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The rest can be proved easily. □ 
The next result for the Lyapunov function is central for our discussion. This 
identity is a continuum analog of 12.61 

Theorem 3.4. If p < t, then 

log£{Ht,Wp) - log £{Hp,Wt) = - (Tsds. 

Jp 

Proof. Assume that the AdaBoost flow runs on the time interval [p, t] with the 
same v{Hs, Wg) i.e. it comes from the same classificator h^. In general, the interval 
[p, t] can be splitted into subintervals with this property. Differential equations 
imply two identities 

Ht = Hp+ / v{Hs, Ws) ds; 



Jp 

on such subintervals, and for any k = 1, ...,m; 

wt{k) = Wp{k) e-/>'=''(^--»)(-'=)'^^ e^p''^'^'. 

Therefore, 

£{Ht,Wp) = ^Wp(A;)e-^^^'("*) = J2uJpik)e~ ^^^"^^^^''"^'^^'"^^'^'6-^^"^^''^^ 



^Wt(fc)e-^>='^'e-^^^''("^) = £{Hp,Wt) e'^>' 



ds 



□ 



Using the fact that, £{Hq,Wp) = 1, we obtain the analog of 12.41 

d 



^^log£iHt,wo) = -at, (3.4) 
and the analog of 12.51 

£iHT,wo) = e-^o-^<i^. (3.5) 

The last identity implies that one should try to choose 7^ so that at is maximal 
along the path. In fact, there are a few choices. As it will be shown below, they 
correspond to the discrete AdaBoost algorithms, the arc-gv algorithm and the 
AdaBoost with varying confidence level. 

3.2. Entropy for the AdaBoost flow. 

Theorem 3.5. When p is less than t , let the AdaBoost flow run with fixed weak 
classifier. For the relative entropy of Wt with respect to Wp the following identity 
holds 

m ( '\ n't 

D{wp \\wt) = ^ Wp{i) log = - {as- ap)ds. 

j=l '^tV') Jp 
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Proof. Differentiating the formula for relative entropy 

m 

D{wp\\wt) = -^^Wpii) log wt{i), 

i=l 

and using the AdaBoost flow equations one finds 

D {Wp\\wt) = - 2_^Wp{i) — -- = cTp - at. 

i=i 

Integrating both parts we obtain the stated formula. □ 
We can rewrite identity of the theorem as 



/ asds = -D{wp\\wt) + ap{t - p). 

•J V 



In the language of large deviation theory, see [H], the integral above is nothing 
but the rate function. Theorem 3.5 states that the rate function can be expressed 
in terms of relative entropy or Kulback-Leibler distance. This is a common fact in 
large deviation theory, 

3.3. Embedding of the AdaBoost map into the AdaBoost flow. In this 
section we assume that all weak classifiers h^, 7 G F; take only two values — c and 
+c, c > 0. The formulas obtained in this section will be generalized for the case 
of classificators with varying confidence level. For each classificator, we define 

W~ = Wo{i : Vih^ixi) = -c} = Wo{i : yih^{xi) = +c}. 

We also assume that 1/2 < < 1. 

Theorem 3.6. Let the AdaBoost flow run up to time A with fixed v{Ht,Wt) i.e. 
it comes from a fixed classificator . Then, 
(i) For any A > 0, 

e" /o^ > 2VW+W-. (3.6) 
(a) The equality in \3.6[ holds if and only if 

1 

(Hi) The equality in \3.6[ holds for some A > if and only if a a = 0. 
Proof (i.) BvlXm 

m 

k=l 

Inequality 13.61 follows from the inequality between the arithmetic and geometric 
means. 
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(ii.) Write, 

Z{A,c) = e-^o^"^'^'. 
If left hand side of 13. 61 attains its minimum and equality holds, then 

dZ 

This implies the formula 13. 7[ The converse statement can be checked directly, 
(iii.) The condition = means that 

= W-e^' - W^e-^' = 0, 

This is equivalent to 13.71 and so to 13.61 as well. □ 
To explain the connection between continuous time system and the AdaBoost 
algorithm we assume that c = 1. One can define the values of the control 74 : 
[0, +00) — )• r, recurrently by the following procedure. Given wq and for t_i = 0, 
define 

7o = 7i_i = aTgmmW~{h^,Wo). (3.8) 

Supposed that W~ = W~{h^f,,Wo) < 1/2. Then (Tq = - > decays with 
time. The AdaBoost flow runs with this 70 until the time to = A; that can be 
seen from 13.71 Note that a a = 0. Therefore, we define 74 = 70 for t G [0,to). It is 
interesting to check that 

1 

2" 

For the next step one should look for a new solution of the minimization problem 
13.81 with Wo replaced by Wt^, etc. 

In general, put tn = J2p=o'^py n > 1. The corresponding control and errors 
are 

7t„-i = argminiy-(/i^,Wt„_J, 
and W~ = W~ {h^^^_^,Wt^_-^) < 1/2. The intervals A„ are determined from 13.71 

1 + 1 1 + 2A„,, 

A„ = — log = - log , 

2c ^W- 2 ^l-2A„_i 

where 

It is easy to see that 

The sequence of Wn) = {Ht^, Wt„) is also a trajectory of the discrete AdaBoost 
algorithm. 
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Wto{i ■■ Vih^oi^i) < 0} = -. 



3.4. Embedding of the arc-gv algorithm into the AdaBoost flow. First we 
formulate a version of the discrete algorithm, see ^ . 
Assume that we have a classifier 

n 

H = tkhk, 

k=0 

where tk > and hk G Tio, for /c = 0, . . . , n. Introduce the norm, 



k=0 

together with the normalized margin of the function H at the point (x, G X x 

m{x,y;H) = VjJ^^ 

and the minimal margin 

fi{H) = min {m{x,y]H)}. 

{x,y)£TS 

Let /i(0) be —1 and note the obvious properties of jJi^H): 

• -1 ^ li{H) ^ 1 

• ii{H) = —1 if and only if there exists {x,y) G TS such that h^^x) ^ y for 
all k = 0, . . . ,n, i.e. there is a point that all weak classifiers constituting 
H, make on error or if = 0. 

• Assume, that fi{H) = 1, then for all {x,y) G TS one has hk{x) = y, where 

= 0, . . . , n; in other words all weak classifiers are not weak, but each of 
them is able to separate points without error. We assume that there are 
no such classifiers at all so that fi{H) < 1. 

• yu(-ff) > if and only if for all {x,y) G TS one has yH{x) > 0, i.e. all 
points classified correctly by the function H. 

Now we describe the arc-gv algorithm itself. 
Initialization, 

• H^, = 0, 

• ^o(^) = ^, 2 = l,...,m, 

• t ^ 1 - regularization parameter (large positive number). 
For n = 0,1, 

• Choose a weak classifier h^^ G T-Lq- W~ {h^^,Wn) < |; 

• /3n = I - W-{h^^,Wn); 

• Determine the weight: t„ = mm{t,^\n{j^) - | ln( }+j^^-j )}; 

• If t„ ^ 0, then the algorithm stops; 
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• Update the measure: Wn+i{i) = exp{—tnyih^„{xi))wn{i)', 

The formula for the weight tn appears from the following optimization problem 
[2]: to minimize in t G [0; t] the function 



m = E 




1=1 

As for the AdaBoost, one finds an exact formula for the optimal t by differentiation. 
Moreover, for [in-i 7^ ±1 : 

Wn+i{i ■■ h^S^i) Vi] 2 

The embedding of arc-gv into AdaBoost flow is the same as for the discrete Ad- 
aBoost algorithm. Note that: 

/it = /i(ift) = \ min {yHt{x)}. 

t {x,y)eTS 

The formulas for embedding are similar: 

A; = mm{A,ilnfi±^V-lnf^^^)L n^,, 

where A is a large fixed number. If at some moment A^ ^ 0, then the algorithm 
stops. 

The general picture is as follows: At the beginning, when = we switch 
classifiers at the equal intervals A. Then /x^ > — 1 + e and the algorithm starts to 
switch at smaller intervals than A, but bigger then prescribed by AdaBoost. That 
happens until ^ 0. At some moment Ht = which is such that constructed 
classifier Ht learned how to separate points without error. Finally, as a protection 
from overfitting, the algorithms stops when > 2A„. 

3.5. Classification with varying confidence level. In this section we will show 
how confidence rated prediction of Schapire and Singer, [10], can be embedded into 
the AdaBoost flow. Let the set of all values of be Cj, j = 1, and take 

W^'^ = Wo{i : h^{xi) = Cj, yi = +1}, 

and 

W~'^ = Wo{i : h^{xi) = Cj, yi = -1}. 
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Theorem 3.7. Fix some A > 0. Let the AdaBoost flow runs up to time t = A 
with fixed v{Hs, Wg) i.e. it comes from the fixed classifier h^. 

(i) Let W^'^ W'^'^ > for all j = 1, ...,p; then for any cj 

p 

e" /cf '^sds 2VW+^j W-^K (3.9) 

(ii) Let W^'^ W~'^ > for all j = 1, then the equality holds in \3.9\ if and 
only if 

1 W+'^ 

in which case ua = 0. 

(ill) Let W+'^W-'^ > for all j = and W+'^ W''^ = for all j 

p' + 1, and if 



= —log 77737' i = l,---,P; 



1 , 

then for any e > 



e-/o '^-'^^ < ^2Viy+J ly-J + e, 

6?/ an appropriate choice of Cj for all j = p' + 1, ...,p. 
Proof, (i) It can be verified directly that 



(Ts (is = — log 



k=l 



Therefore, 



fc=i j=i 

Inequality 13 . 9 1 follows from the inequality between arithmetic and geometric means. 
Parts (ii) and (Hi) follow from this formula similar to the proof of Theorem 

ESI □ 

The theorem suggests the following procedure. We put Ap = 1 for all p = 
0, 1, 2, .... On each round of the boosting procedure, we pick h^, 7 G F; such that 
the corresponding sum 

p 

z = ^2Viy+'j' ly-j, 

is minimal over the set of all weak classifiers. By adjusting the values of accord- 
ing to formulas of Theorem 3.7 we minimize the penalty function S on this round 
in an optimal way. 
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Let us give some explanation to the square roots which appear in the formula 
for Z. The set of all values of /i^ is a finite set Cj, j = 1, Therefore, we have 
two special points of p — 1 dimensional simplex of probability measures 



and 



It is apparent that 



Z = ^2\/W+'^ W-'^ = 2VW+W~BC{p+,p-). 
i=i 

where BC{p,q) is a Bhattacharyya divergence, [3], the standard measure a sepa- 
rability of classes in classification. 

3.6. SuperBoost algorithm. In this section we want to introduce a new Super- 
Boost algorithm motivated by our continuous time considerations. It is a greedy 
algorithm which for each moment of time t > chooses a week classifier h with 
the largest (Jt{h). It would drive the error of classification to zero with the fastest 
possible rate. 

Initialization, 

• H^i = 0, 

• ^o(«) = ^, 2 = l,...,m, 

• Choose weak classificator h^^ G "Ho such that : aw^^^h^^) = max/ig^^ (^^^{h) 
Updates are occurring on each infinitesimal step t t + dt 

• Change classifier h-y^ = h on a new h^^^^^ = h' if 

d d 
at{h) = at{h') and — at{h) < — crt{h'). 

• Update the measure: Wt+dt{i) = ;| exp(— o?i:|/j/i^j(xi))wt(i); 

• Update the resulting classifier: Ht+at = Hf + dt x h^^ 

It can be proved that SuperBoost algorithm for each finite time interval [0, T] 
updates the weak classifier only a finite number of times. 

3.7. Boosting and Perelman's ideas for the Ricci flow. This section is the 
most speculative part of our work. In our notations we follow pTj and [13]. Here 
we address a striking similarity between AdaBoost flow and Perelman's ideas, [7], 
to control the Ricci flow 

j^gt = -2RiCg, (3.10) 

where RiCg is the Ricci tensor of the metric g and g E Ai space of metrics on a 
Riemannian manifold M. The equation describes some optimization procedure in 
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the space Ai. Perelman extends that phase space. Namely he defines Gibbsian 
type measure dw on M as 

dw = dVg, 

where dVg is a volume element constructed from the metric g. Apparently for given 
g the measure dw can be identified with the potential function /. Now the fiow 
is defined on the extended phase space M. x C°°. In order to control singularities 
of the Ricci fiow gt-, t > 0; Perelman chooses the potential function / in a special 
way determined by dynamics of the metric. The system of coupled equations for 
the metric g and potential function / 

j^gt = -2{RiCg, + Hessgjt), (3.11) 

d 
dt 

where Rg is the scalar curvature of the metric g, leads to 

|(..) = |(e-'W,.)=0. 

The fiow defined by 13.111 is the original Ricci fiow 13.101 up to a time dependent 
diffeomor phism . 

These equations are analog of the AdaBoost fiow equations 13.11 and 13.21 To 
be precise equation 13.121 is an analog of 13.31 which is an equivalent form of 13.21 
Moreover, these equations are similar termwise. The scalar curvature Rg^ in 13.121 
plays the role similar to that of the margin ykv{Ht,Wt){xk) in 13. 3[ The Laplacian 
Aft in 13.121 is similar to the term at in 13. 3[ 

On the extended phase space Ai x C°° Perelman defines the following functional 



f, = -R^^-Aft, (3.12) 



H9J)= [ {Rg + lvfne-^dVg. 

J M 



Perelman calls the functional J^{g., f) entropy for the Ricci flow. The functional 
J^{g, /) increases along trajectories of the Ricci fiow. Indeed, the formula 

J^^^du ft) = J^^{-Ricg, - Hessgjt, ^)e'^'dVg^, 

together with equations 13.111 and 13.121 leads to 

-T{gt, ft) = 2 \RiCg, + HesSgJtl'e-f'dVg, > 0. (3.13) 
"J Jm 

The functional J^{g, f) is an analog of the Lyapunov function £{H, w). As we saw 
in section 3.2 the functional S{H,w) for the Ada Boost fiow is closely connected 
to the ordinary KuUback-Leibler entropy. It steadily decreases along trajectories 
of the AdaBoost fiow. 
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It is time to take stock of these similarities. The dictionary between two prob- 
lems is below 



TS training set 


M Riemannian manifold 


H cone over the set of classifiers 


Ai space of Riemannian metrics 


H xW phase space of the 
AdaBoost flow 


M X C°° phase space of the 
controlled Ricci flow 




fgt = -2{RiCg, + Hessgjt) 


ft ftik) = ykv{Ht,Wt){xk) - at 


df.ft = -Rgt - ^ft 


£{H,w) 




f log S{Ht, Wo) = -at 


iH9t,ft)>0 
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